-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating SimpleGermlineTagger and somatic CNV experimental post-processing workflow #5252
Conversation
Prototype WDL and first cut at a workflow. Making the MergeAnnotatedRegionsByAnnotation tool spit out a GATK seg file. More automated testing A little more automated testing. Adding a simple WDL to convert IGV and concatenate. Lots of TODO addressed. More WDL changes. More WDL changes. Adding more docs Adding more docs More documentation updates. More documentation updates. More documentation updates. More documentation updates. Some more last minute doc updates.
760ea8c
to
3513e38
Compare
@samuelklee As a reminder, everything here is unsupported and/or experimental. |
Codecov Report
@@ Coverage Diff @@
## master #5252 +/- ##
===============================================
- Coverage 86.757% 80.631% -6.127%
+ Complexity 29763 28463 -1300
===============================================
Files 1825 1830 +5
Lines 137699 138368 +669
Branches 15176 15237 +61
===============================================
- Hits 119464 111567 -7897
- Misses 12721 21438 +8717
+ Partials 5514 5363 -151
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. A few minor questions/comments on the WDL and Docker images. Will leave it up to you if you want to keep the IGV stuff, but I think it's worth cleaning up now. Didn't look at the Java too closely, but I'll trust that everything is OK there. Back to you, @LeeTL1220!
@@ -5,7 +5,7 @@ | |||
# - The intervals argument is required for both WGS and WES workflows and accepts formats compatible with the | |||
# GATK -L argument (see https://gatkforums.broadinstitute.org/gatk/discussion/11009/intervals-and-interval-lists). | |||
# These intervals will be padded on both sides by the amount specified by PreprocessIntervals.padding (default 250) | |||
# and split into bins of length specified by PreprocessIntervals.bin_length (default 1000; specify 0 to skip binning, | |||
# and split into bins of length specified by bin_length (default 1000; specify 0 to skip binning, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Can you do a find-and-replace on other instances? Should be a few more in the other somatic/germline WDLs as well as the docs. Also may be some instances of PreprocessIntervals.padding
.
BTW, where are we with Cromwell exposing task-level parameters automatically for subworkflows? Do we still need to expose all parameters in this way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done for padding as well.
Your comment here is related to broadinstitute/cromwell#2912 ... I have not heard about any update on this.
# Unsupported workflow that concatenates the IGV compatible files generated by multiple runs of combine_tracks.wdl | ||
workflow AggregateCombinedTracksWorkflow { | ||
String group_id | ||
Array[File] tumor_with_germline_pruned_segs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so I understand, this is the seg file with segments that have been tagged as germline removed? If so, then I think "filtered" is better than "pruned", since the latter suggests a tree structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tumor_with_germline_filtered_segs
is the new name.
Done
|
||
task TsvCat { | ||
|
||
String id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some white space funkiness here and throughout the other WDLs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed it here. Will be on lookout in other files.
|
||
for FILE in ${sep=" " input_files} | ||
do | ||
egrep -v "CONTIG|Chromo" $FILE >> ${id}.aggregated.seg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just go ahead and put Chromosome
to avoid any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# | ||
# UNSUPPORTED and we may modify this WDL before we fully support it. | ||
# | ||
# This will also generate absolute (https://software.broadinstitute.org/cancer/cga/absolute) compatible files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolute -> ABSOLUTE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
>>> | ||
|
||
runtime { | ||
docker: "continuumio/anaconda" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This repo seems a little more legit, but perhaps we should still use our own GCR image if appropriate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I do not want to put our GCR image into anything that is seen outside the Broad.
No action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth just exposing the Docker images as parameters, then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, but only on the tasks. These will run fine with the GATK docker image (I have tested), so the workflow will feed that docker image.
More versatile if broadinstitute/cromwell#2912 gets addressed.
sep='\t', comment='@', na_values='NA') | ||
model_segments_af_param_pd = pd.read_csv(model_segments_af_param_input_file, sep='\t', comment='@') | ||
|
||
def simple_determine_allelic_fraction(model_segments_seg_pd): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened to the beta fitting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found that it did very little on the test data that I had on hand. I do not yet have a good evaluation set up for the ABSOLUTE conversion, which I would like to do before making more changes. We can discuss offline -- I think I have an approach.
No action.
final Map<AnnotatedInterval,CalledCopyRatioSegment.Call> result = new HashMap<>(); | ||
for (final AnnotatedInterval normalSeg : nonZeroMergedNormalSegmentsToTumorSegments.keySet()) { | ||
final List<AnnotatedInterval> overlappingTumorSegments = nonZeroMergedNormalSegmentsToTumorSegments.get(normalSeg); | ||
|
||
final List<AnnotatedInterval> mergedTumorSegments = mergedRegionsByAnnotation(callAnnotation, overlappingTumorSegments); | ||
final boolean isReciprocolOverlapSeen = mergedTumorSegments.stream() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reciprocol -> Reciprocal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
import java.util.stream.Collectors; | ||
|
||
public class MergeAnnotatedRegionsByAnnotationIntegrationTest extends CommandLineProgramTest { | ||
private static final String TEST_SUB_DIR = publicTestDir + "org/broadinstitute/hellbender/tools/copynumber/utils/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use toolsTestDir
to clean this up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
final List<AnnotatedInterval> mergedTumorSegments = mergedRegionsByAnnotation(callAnnotation, overlappingTumorSegments); | ||
final boolean isReciprocolOverlapSeen = mergedTumorSegments.stream() | ||
.anyMatch(s -> (normalSeg.getInterval().intersect(s).size() > (s.getInterval().size() * reciprocalThreshold)) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably cleaner to extract a test for reciprocal overlap, but up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done... Added utility to IntervalUtils
Thanks for adding that method. But shouldn’t the threshold be in [0, 1]? |
@samuelklee I added the range check in reciprocal overlap method. Please take a look at the rest, as I did not directly address all comments (I filed issues). |
@samuelklee Back to you... tell me if I am okay to merge. |
@samuelklee The repo requires an explicit approval of the review. |
…ssing workflow (broadinstitute#5252) Several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV: - `combine_tracks.wdl` for post-processing somatic CNV calls. This wdl will perform two operations: - Increase precision by removing: - germline segments. As a result, the WDL requires the matched normal segments. - Areas of common germline activity or error from other cancer studies. - Convert the tumor model seg file to the same format as AllelicCapSeg, which can be read by ABSOLUTE. This is currently done inline in the WDL. - This is not a trivial conversion, since each segment must be called whether it is balanced or not (MAF =? 0.5). The current algorithm relies on hard filtering and may need updating pending evaluation. - For more information about AllelicCapSeg and ABSOLUTE, see: - Carter et al. *Absolute quantification of somatic DNA alterations in human cancer*, Nat Biotechnol. 2012 May; 30(5): 413–421 - https://software.broadinstitute.org/cancer/cga/absolute - Brastianos, P.K., Carter S.L., et al. *Genomic Characterization of Brain Metastases Reveals Branched Evolution and Potential Therapeutic Targets* (2015) Cancer Discovery PMID:26410082 - Changes to GATK tools to support the above: - `SimpleGermlineTagger` now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres. - Added tool `MergeAnnotatedRegionsByAnnotation`. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance. - `multi_combine_tracks.wdl` and `aggregate_combine_tracks.wdl` which run `combine_tracks.wdl` on multiple pairs and combine the results into one seg file for easy consumption by IGV.
Several experimental changes that improve precision results, and expand possible evaluations, of GATK CNV:
combine_tracks.wdl
for post-processing somatic CNV calls. This wdl will perform two operations:Changes to GATK tools to support the above:
SimpleGermlineTagger
now uses reciprocal overlap to in addition to breakpoint matching when determining a possible germline event. This greatly improved results in areas near centromeres.MergeAnnotatedRegionsByAnnotation
. This simple tool will merge genomic regions (specified in a tsv) when given annotations (columns) contain exact values in neighboring segments and the segments are within a specified maximum genomic distance.multi_combine_tracks.wdl
andaggregate_combine_tracks.wdl
which runcombine_tracks.wdl
on multiple pairs and combine the results into one seg file for easy consumption by IGV.